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Abstract 

Training  a  named  entity  recognizer  (NER) 
has  always  been  a  difficult  task  due  to  the 
effort  required  to  generate  a  significant 
amount  of  annotated  training  data.  In  this 
paper,  we  reduce  or  eliminate  the  effort 
required  to  create  training  data  by  auto¬ 
matically  converting  other  sources  of  data 
into  annotated  training  data.  The  perform¬ 
ance  of  this  approach  is  tested  on  a  gene- 
protein  name  extractor  by  using  the 
mouse  and  fly  data  obtained  from  the 
BioCreAtIvE  challenge.  Results  show  that 
our  methods  are  effective  and  that  our 
trained  NER  system  outperforms  all  of 
our  baseline  results. 


1  Introduction 

Many  prior  research  papers  on  biological  text¬ 
mining  have  developed  machine-learned  named 
entity  recognition  (NER)  systems  to  identify  sub¬ 
strings  in  biomedical  publications  that  correspond 
to  gene  and  protein  names,  usually  without  distin¬ 
guishing  between  them  [4,  9,  11,  16].  These  NER 
systems  are  often  trained  on  large  amounts  of 
manually  annotated  training  examples,  consisting 
of  documents  with  the  position  of  every  named 
entity  marked.  This  training  data  is  difficult  to  pro¬ 
duce. 

Training  data  for  gene-protein  entities  is  espe¬ 
cially  difficult  to  produce  because  labeling  docu¬ 
ments  requires  expertise  in  biology.  Although  a 


number  of  corpora  have  been  annotated,  the  docu¬ 
ments  in  these  corpora  are  drawn  from  specific 
sub-areas  of  biology.  Here  we  consider  two  such 
corpora:  the  YAPEX'  [10]  training  corpus,  which 
consists  of  Medline  abstracts  selected  as  likely  to 
contain  information  about  protein-protein  interac¬ 
tions;  and  the  GENIA^  [7]  corpus,  which  contains 
abstracts  likely  to  contain  information  about  cell 
signaling  in  human  blood  cells.  As  we  will  show, 
extractors  trained  on  these  corpora  appear  to  be 
distribution-specific  (i.e.  they  do  not  transfer  well 
to  other  sub-areas  of  biology,  or  different  genres  of 
text  within  the  same  sub-area). 

The  distribution-specificity  of  learned  NER  sys¬ 
tems  makes  it  difficult  to  use  them  in  certain  types 
of  text-mining  systems.  As  an  example,  consider 
the  SEIF  system  [23],  which  mines  full-text  bio¬ 
medical  publications  for  information  about  sub- 
cellular  localization  of  proteins.  More  specifically, 
SEIF  finds  figures  containing  images  of  a  certain 
sort  (fluorescence  microscope  images  depicting 
protein  localization),  and  then  collects,  analyzes 
and  indexes  these  figures  by  the  proteins  depicted. 
For  this  application  it  is  necessary  to  apply  NER 
methods  to  figure  captions;  however,  the  majority 
of  NER  training  sets  are  annotated  abstracts. 

Motivated  by  such  problems,  this  paper  ex¬ 
plores  several  approaches  for  training  a  gene- 
protein  NER  system  with  data  sources  that  are  of¬ 
ten  easier  to  obtain.  The  first  source  is  NER  anno¬ 
tations  for  a  related,  but  slightly  different  corpus: 
this  reflects  the  common  practice  of  applying  a 
learned  NER  system  to  documents  that  are  drawn 


*  Available  from  http://www.sics.se/humle/projects/prothalt 
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Dataset 

Mouse 

Fly 

Data 

Eval. 

Weak-train 

Curated 

Eval. 

Weak-train 

Curated 

#  of  Abstracts 

50 

200 

1000 

51 

57 

1000 

Abstract  IDs 

100-149 

1-99,  150-250 

4000-4999 

[1-298] 

[308-494] 

4000-4999 

Table  1.  Distribution  of  abstracts  among  various  data  for  each  of  mouse  and  fly  dataset. 
Numbers  embraced  by  brackets  indicate  a  subset  of  these  numbers. 


from  a  slightly  different  distribution.  The  second 
source  is  a  synonym  list  -  a  list  of  gene  identifiers 
together  with  synonyms  for  each  identifier.  The 
third  source  is  weak  labels,  which  associate  a 
document  with  identifiers  for  each  gene-protein 
entity  that  appears  in  the  document.  Weak  labels 
for  text  can  often  be  automatically  obtained  by 
analyzing  databases  of  information  extracted  from 
text.  Specifically,  weak  labels  are  often  obtainable 
for  biomedical  documents  by  analysis  of  manually- 
curated  biological  knowledge  bases  such  as  Fly- 
Base  [1]  and  MGI  [2]. 

One  prior  experimental  study  that  exploits 
synonym  lists  and  weak  labels  is  BioCreAtIvE  task 
IB  [14,  15],  which  collected  common  test-bed 
problems  and  a  common  evaluation  framework  for 
determining  the  database  identifier  of  every  gene 
mentioned  in  biomedical  abstracts  -  a  task  closely 
related  to  NER,  but  distinct.  Three  separate  test¬ 
bed  problems  were  developed,  one  for  each  of 
three  model  organisms:  yeast,  fly,  and  mouse. 

In  this  paper,  we  utilize  only  the  mouse  and  fly 
datasets,  which  were  the  two  hardest  for  the  Bio¬ 
CreAtIvE  participants,  for  training  a  gene -protein 
NER  system.  Performance  on  NER  is  evaluated  by 
testing  on  a  small  subset  of  the  BioCreAtivE  test 
set  that  was  manually  annotated.  We  compare 
weakly-learned  NER  systems  with  results  for  four 
baseline  systems.  The  first  baseline  is  a  dictionary- 
based  extractor,  which  soft-matches  words  from  a 
synonym  list  to  a  corpus.  The  second,  third,  and 
fourth  baselines  are  machine-learned  NER  systems 
trained  on  the  GENIA  dataset,  the  YAPEX  dataset, 
and  small  corpora  of  conventionally-labeled 
documents  from  the  BioCreAtive  datasets. 

Experimentally,  we  show  that  no  baseline  sys¬ 
tem  performs  well  on  the  evaluation  data  -  the  best 
baseline  Fj  measures  reach  only  57%  on  the  mouse 
data,  and  41%  on  the  fly  data.  We  then  present  re¬ 
sults  for  several  alternative  approaches  that  use 
weak  labels,  and  demonstrate  that  much  better  per¬ 


formance  can  be  obtained  with  weakly-trained 
NER  systems. 

Our  approach  for  weak-label  learning  consists 
of  four  steps.  First,  we  look  up,  for  each  abstract, 
its  associated  gene  identifiers  and  we  label  all  pos¬ 
sible  locations  of  synonyms  associated  with  these 
identifiers  in  that  same  abstract.  Second,  we  train 
extractors  on  these  weakly  labeled  abstracts,  using 
word  features  such  as  string  similarity  to  synonyms 
[24].  We  also  investigated  a  pre-processing  step,  of 
removing  from  the  training  set  sentences  not  con¬ 
taining  any  weak  labels;  and  a  post-processing  step 
that  exploits  inter-document  repetition  of  names 
[20]  by  soft-matching  every  instance  of  an  ex¬ 
tracted  name  against  the  document  in  which  it  oc¬ 
curs,  and  classifying  every  such  soft-match  as  a 
protein  name.  To  further  evaluate  our  weak-label 
learning  approach,  we  present  also  results  for  NER 
systems  tuned  for  either  precision  or  recall  [21]. 
Our  results  show  that  the  quality  of  a  NER  system 
can  be  improved  through  the  use  of  readily  avail¬ 
able  weakly-labeled  data. 

We  use  datasets  from  BioCreAtIvE  task  IB, 
specifically  the  mouse  and  fly  datasets,  which  were 
drawn  from  MGI  and  FlyBase  respectively.  For 
each  dataset,  we  constructed  three  corpora  for  our 
experiments:  evaluation,  weak-train,  and  curated. 
The  evaluation  and  weak-train  data  are  subsets  of 
the  BioCreAtIvE  “devtest”  set,  and  curated  data  is 
a  subset  of  the  “training”  set.  Table  1  summarizes, 
for  both  datasets,  their  size,  and  also  lists  the  spe¬ 
cific  abstracts  (by  BioCreAtIvE  ID)  that  were  used 
to  form  the  dataset.  In  curated,  each  abstract  is  as¬ 
sociated  with  gene  identifiers  of  all  genes  that  are 
mentioned  in  the  full  text  of  the  abstract.  However, 
in  weak-train,  each  abstract  is  only  associated  with 
identifiers  of  some  genes  mentioned  in  the  abstract. 
Hence,  curated  is  noisier  than  weak-train.  We  also 
utilize  the  synonym  lists  provided  by  the  mouse 
and  fly  datasets,  which  contain  associations  be¬ 
tween  synonyms  and  unique  gene  identifiers.  The 
list  for  the  mouse  dataset  consists  of  183,142 


synonyms  for  52,594  identifiers,  and  for  the  fly 
dataset,  135,471  synonyms  for  35,970  identifiers. 
To  evaluate  our  NER  systems,  the  abstracts  in  the 
evaluation  data  were  manually  annotated  with 
gene-protein  entity  names. 

2  Baseline  Methods 

2,1  Global  Edit  Distance 

In  order  to  train  a  gene-protein  NER  system  using 
a  synonym  list,  we  devise  a  feature  that  indicates 
how  similar  each  word  in  the  abstracts  is  to  the 
most  similar  word  in  the  entire  (global)  synonym 
list.  The  similarity  measure  incorporates  Eeven- 
shtein  Distance  [19],  and  thus  we  call  this  the 
global  edit  distance  (GED)  feature.  Elsewhere  it 
has  been  shown  that  features  of  this  sort  can  sub¬ 
stantially  improve  NER  performance  [24]. 

More  specifically,  GED  case-insensitively  cal¬ 
culates  a  similarity  score  between  two  strings,  s 
and  s  ’,  as: 


SimScore  (s,s')  =  1  - 


_ LD{s,s') _ 

max(  length  (s),  length  {s')) 


(1) 


where  LD{s,  s’)  is  the  Eevenshtein  Distance  be¬ 
tween  string  s  and  s  and  length{s)  is  the  number 
of  characters  in  s.  We  determine  and  assign  simi¬ 
larity  scores  to  each  word  in  the  abstracts  by  trav¬ 
ersing  through  each  synonym  in  a  given  list.  For 
each  synonym  s,  we  determine  number  of  words  n 
contained  in  s,  and  create  sliding  windows  of  size 
ranging  from  [o.Su]  to  [i,5„J  on  the  abstract.  For 
each  string  s’  contained  within  each  sliding  win¬ 
dow,  we  assign  SimScore{s,  s  ’)  to  each  word  w  in 
s  ’  unless  one  of  the  following  conditions  is  met:  a) 
w  has  higher  similarity  to  some  other  s”  in  the 
synonym  list,  b)  s  or  s  ’  has  only  one  character,  c)  s 
or  s  ’  case-insensitively  matches  any  word  in  a  list 
of  common  stop-words  (see  Appendix  A),  or  d)  the 
first  and  last  characters  of  s  are  not  identical  to 
those  of  s  ’. 

2,2  Soft  Matching 

Biological  scientists  often  use  novel  variations  of 
existing  gene  names  in  their  papers;  thus,  in  order 
to  match  these  names  from  abstracts  to  the  syno¬ 
nym  list,  we  incorporate  an  approximate  string 
matching  technique  called  soft  matching,  which 


identifies  strings  that  are  similar  but  not  necessar¬ 
ily  identical.  This  method  has  been  proven  to  be 
useful  [13];  however,  our  method  is  on  the  charac¬ 
ter-level  instead  of  word-level.  Our  soft  matching 
is  performed  as  follows:  First,  we  assign  similarity 
scores  to  words  in  given  abstracts  using  a  given 
synonym  list,  as  described  in  2.1.  We  then  label  all 
the  longest  consecutive  sequences  of  words  that 
have  similarity  scores  above  a  given  similarity 
threshold  as  a  gene -protein  entity  name. 

2.3  NER  on  YAPEX  &  GENIA 

We  use  an  off-the-shelf  machine  learning  system 
for  NER  called  Minorthird  [6]  for  training  our 
gene-protein  NER  system  on  the  YAPEX  and 
GENIA  corpora.  We  used  Minorthird’s  default 
feature  set,  which  contains  basic  features  such  as 
word  identity  and  capitalization  patterns.  In  addi¬ 
tion,  we  used  Minorthird’s  implementation  of  VP- 
HMM  -  a  voted-perceptron  based  training  scheme 
for  HMMs  due  to  Collins  [8].  VP-HMM  is  gener¬ 
ally  competitive  with  conditional  random  field 
(CRF)  learning  methods,  but  converges  more 
quickly.  More  specifically,  as  we  configured  this 
learner,  NER  is  reduced  to  the  problem  of  classify¬ 
ing  each  token  as  the  beginning  or  continuation  of 
a  multi-token  gene-protein  name;  or  as  outside  of 
any  gene-protein  name.  We  configured  the  extrac¬ 
tor  to  make  20  passes  (epochs)  over  the  training 
data  using  a  window  size  of  three  words. 

The  YAPEX  dataset  consists  of  a  training  cor¬ 
pus  of  99  Medline  abstracts  and  a  testing  corpus  of 
101  Medline  abstracts.  These  documents  deal  pri¬ 
marily  with  protein-protein  interactions,  and  are 
annotated  for  gene-protein  entities.  We  trained  a 
VP-HMM  extractor  on  the  training  corpus  of 
YAPEX  using  Minorthird’s  default  features.  The 
GENIA  dataset  consists  of  a  training  corpus  of  500 
Medline  abstracts  and  a  testing  corpus  of  300  Med¬ 
line  abstracts,  mostly  concerning  cell  signaling  for 
human  cells.  We  trained  a  VP-HMM  extractor  on 
the  training  corpus  of  GENIA  using  default  fea¬ 
tures,  plus  protein-specific  features  described 
elsewhere  [18]. 

2.4  Single  Document  Repetition 

When  a  substring  is  identified  as  a  named  entity  in 
a  document,  it  is  highly  possible  that  all  other  oc¬ 
currences  of  that  substring  in  the  same  document 
are  also  named  entities.  Repetition  of  names  in  text 


Entity  Fi 

Mouse 

A  Baseline 

A  Complete 

Entity  Fj 

Fly 

A  Baseline 

A  Complete 

Best  Baseline 

Complete  System 

57.64 

71.41 

23.89% 

-19.28% 

45.75 

61.90 

35.30% 

-26.09% 

Complete  -  Weak-train 

71.12 

23.39% 

-0.41% 

63.02 

37.75% 

1.81% 

Complete  -  Filter 

70.29 

21.95% 

-1.57% 

63.54 

38.89% 

2.65% 

Complete  -  SDR 

67.45 

17.02% 

-5.55% 

65.39 

42.93% 

5.64% 

Complete  -  GED 

66.67 

15.67% 

-6.64% 

57.14 

24.90% 

-7.69% 

Best  System  1:  (Complete) 

71.41 

23.89% 

- 

61.90 

35.30% 

- 

Best  System  2:  (Complete  - 
Weak-train  -  Filter  -  SDR) 

60.39 

4.77% 

-15.43% 

66.41 

45.16% 

7.29% 

Table  2.  Summary  of  the  performance  of  our  best  baseline  system  and  various  configurations  of  our 
complete  system  for  each  of  mouse  and  fly  dataset.  Configurations  are  derived  by  subtracting  various 
components  from  the  complete  system.  Detailed  results  are  presented  in  Table  3  for  the  mouse  and  Table 
4  for  the  fly. 


has  proven  useful  on  many  occasions  [3,  17,  20, 
25].  We  incorporate  a  post-processing  step  that 
exploits  repetition  of  entity  names  within  a  single 
document  using  the  gene-protein  names  extracted 
by  our  trained  NER  systems.  More  specifically,  for 
each  abstract,  it  collects  all  the  extracted  names 
from  that  abstract,  and  soft-matches  these  names 
against  the  words  in  the  same  abstract,  using  a  con¬ 
stant  threshold  of  0.5  throughout  our  experiments; 
we  refer  to  this  as  single  document  repetition 
(SDR)  labeling. 

3  Approach 

3.1  Grounding  Weak  Labels 

In  the  BioCreAtIvE  challenge,  one  unique  charac¬ 
teristic  of  the  datasets  is  that  there  are  synonym 
lists  and  weak  labels.  Therefore,  for  each  abstract, 
we  can  approximately  locate  gene  names  by  soft- 
matching  synonyms  of  identifiers  associated  with 
that  abstract  against  the  words  in  the  same  ab¬ 
stracts.  For  this  process,  we  used  the  fixed  similar¬ 
ity  threshold  of  0.5  for  both  the  mouse  and  the  fly 
datasets;  we  will  refer  to  this  process  as  grounding 
the  weak  labels.  The  result  of  grounding  is  a  set  of 
documents  that  are  noisily  annotated;  a  preliminary 
evaluation  of  our  grounding  method  on  the  evalua¬ 
tion  data  shows  that  grounding  gives  an  entity- 
level  precision  of  81%,  recall  of  65%,  and  Fi  of 
72%  for  the  mouse  dataset,  and  precision  of  73%, 
recall  of  70%,  and  Fi  of  71%  for  the  fly  dataset. 


3.2  Sentence  Filtering 

Often  genes  that  are  mentioned  but  not  associated 
with  new  results  are  not  curated.  Sections  of  a 
document  that  discuss  these  genes  will  become 
false  negatives  in  our  training  set  -  i.e.,  they  con¬ 
tain  substrings  that  should  be  annotated  as  protein 
names,  but  are  not.  One  method  for  eliminating 
(some  of)  these  false  negatives  is  to  filter  out  por¬ 
tions  of  the  document  that  are  likely  to  contain 
false  negatives.  We  thus  incorporate  a  pre¬ 
processing  step  of  filtering  training  examples:  spe¬ 
cifically,  we  split  abstracts  into  sentences  (using  a 
regular  expression),  and  then  remove  sentences  in 
the  training  data  that  do  not  contain  any  grounded 
gene-protein  synonyms.  We  call  this  the  sentence 
filtering  process.  Recently,  the  same  sentence  fil¬ 
tering  technique  was  independently  described  by 
Vlachos  and  Gasperin  [26]. 

Sentence-filtering  will  also  remove  many  true 
negative  examples;  hence,  one  might  expect  that 
sentence-filtering  would  lead  to  an  over-general 
extractor,  and  hence  increase  recall  at  the  expense 
of  precision.  Section  5  discusses  methods  to  com¬ 
pensate  for  this  bias. 

4  Experiments 
4.1  Settings 

We  trained  a  VP-HMM  extractor  on  each  of  the 
following  three  datasets:  weak-train,  curated,  and  a 
combined  set,  merged,  which  is  the  union  of  cu- 


-SDR 

+SDR 

Entity  Free. 

1  Entity  Recall 

Entity  F; 

Entity  Free. 

1  Entity  Recall 

1  Entity  F; 

YAPEX 

68.36 

27.56 

39.29 

69.28 

48.29 

56.91 

GENIA 

66.46 

24.37 

35.67 

67.45 

39.18 

49.57 

Dictionary 

50.34 

67.43 

57.64 

47.56 

66.51 

55.46 

C.V. 

-GED 

54.81 

29.84 

38.64 

49.28 

38.95 

43.51 

Eval. 

+GED 

59.05 

53.53 

56.15 

54.75 

60.36 

57.42 

Weak- 

-GED 

82.39 

26.65 

40.28 

78.76 

34.62 

48.10 

train 

+GED 

78.47 

48.97 

60.31 

75.58 

59.23 

66.41 

-GED 

-Eilter 

90.82 

20.27 

33.15 

90.96 

34.40 

49.92 

<a> 

+Eilter 

74.67 

38.27 

50.60 

71.97 

60.82 

65.93 

5 

+GED 

-Eilter 

87.83 

46.01 

60.39 

83.59 

61.50 

70.87 

u 

+Eilter 

80.91 

56.95 

66.84 

76.10 

66.74 

71.12 

-GED 

-Eilter 

90.35 

23.46 

37.25 

78.63 

41.91 

54.68 

M) 

+Eilter* 

78.30 

41.91 

54.60 

73.10 

61.28 

66.67 

<a> 

+GED 

-Eilter 

87.40 

50.57 

64.07 

84.13 

60.36 

70.29 

+Eilter* 

79.57 

58.54 

67.45 

75.90 

67.43 

71.41 

Table  3.  Performance  of  the  four  baselines  (YAPEX,  GENIA,  Dictionary,  and  C.V.  Eval.)  and  our  NER 
systems  (Weak-train,  Curated,  Merged)  at  entity-level  tested  on  the  mouse  evaluation  data.  Bold  A/ 
scores  represent  scores  that  are  higher  than  any  corresponding  baseline.  Extractors  denoted  by  *  will  be 
tuned  in  section  5. 


rated  and  weak-train.  Each  of  these  datasets  is 
weakly-labeled  with  grounded  gene-protein  syno¬ 
nyms,  using  the  approach  described  in  3.1.  Each 
trained  extractor  is  evaluated  with  various  combi¬ 
nations  of  sentence  filtering,  SDR  labeling,  and 
GED  features.  These  extractors  are  evaluated  on 
the  evaluation  data  at  the  entity-level  (i.e.,  no  par¬ 
tial  credit  is  given  for  nearly-correct  entity  bounda¬ 
ries). 

We  compare  our  NER  system’s  performance  to 
four  baselines:  a)  an  extractor  trained  on  YAPEX, 
b)  another  trained  on  GENIA,  c)  10-fold  cross 
validation  on  the  evaluation  data,  and  d)  a  global 
dictionary  soft-matcher  which  soft-matches  every 
synonym  from  an  entire  synonym  list  to  the 
evaluation  data  (exact-matching  was  found  to  per¬ 
form  worse).  The  similarity  thresholds^  of  the  soft- 
matcher  were  pre-determined  to  optimize  A/  meas¬ 
ure  on  the  evaluation  data,  so  they  are  optimistic 
assessments  of  the  performance  of  this  sort  of 
technique.  In  addition  to  the  four  baseline  perform¬ 
ances,  we  present  our  NER  systems  performance  at 
the  entity-level  in  Table  3  and  4. 


^  Specifically,  they  are  0.85  for  mouse  and  0.95  for  fly  dataset 


4.2  Results 

None  of  the  baseline  methods  is  competitive  with 
the  complete  system  (including  GED  features, 
SDR,  and  sentence  fdtering)  trained  on  the  largest 
weakly-labeled  dataset  (merged).  Table  2  shows  a 
summary  of  our  experimental  results.  For  mouse, 
the  complete  system  obtains  an  A;  of  71.4%  and 
the  best  baseline  (soft-match  to  the  dictionary)  ob¬ 
tains  an  A;  of  57.6%;  for  fly,  the  complete  system 
obtains  an  A;  of  61.9%,  and  the  best  baseline  (a 
Y APEX- trained  system)  obtains  A;  of  only  45.8%. 
Table  2  also  shows  the  relative  improvement  in  A; 
between  the  complete  systems  and  the  best  base¬ 
line  -  the  improvement  is  nearly  24%  for  mouse, 
and  more  than  35%  for  fly. 

Table  2  also  shows  the  results  for  training  on 
only  the  curated  data  (in  the  row  labeled  “Com¬ 
plete  -  Weak-train’’);  for  training  without  sentence- 
filtering  (row  “Complete  -  Filter’’);  for  training 
without  SDR;  and  for  training  without  the  GED 
features.  Each  of  these  ablations  performs  worse 
on  the  mouse  data,  although  the  effects  are  small 
for  “Complete  -  Filter’’  and  “Complete  -  Weak- 
train’’.  For  fly,  the  trends  are  less  clear:  removing 
the  GED  features  clearly  leads  to  lower  perform¬ 
ance,  but  removing  SDR  results  in  noticeably 
higher  performance,  and  removing  sentence- 


-SDR 

+SDR 

Entity  Free. 

1  Entity  Recall 

Entity  F; 

Entity  Free. 

1  Entity  Recall 

Entity  F; 

YAPEX 

66.00 

23.32 

34.46 

68.79 

34.28 

45.75 

GENIA 

44.16 

12.01 

18.89 

59.06 

26.50 

36.59 

Dictionary 

28.92 

70.32 

40.99 

27.75 

70.32 

39.80 

C.V. 

-GED 

39.13 

9.54 

15.34 

46.38 

22.61 

30.40 

Eval. 

+GED 

37.59 

36.40 

36.98 

35.68 

46.64 

40.43 

Weak- 

-GED 

37.50 

3.18 

5.86 

38.46 

7.07 

11.94 

train 

+GED 

51.89 

38.87 

44.44 

56.79 

57.60 

57.19 

-GED 

-Eilter 

78.43 

28.27 

41.56 

75.71 

47.35 

58.26 

<a> 

+Eilter 

63.64 

44.52 

52.39 

55.25 

57.60 

56.40 

5 

+GED 

-Eilter 

73.19 

60.78 

66.41 

64.31 

64.31 

64.31 

u 

+Eilter 

65.73 

66.43 

66.08 

57.82 

69.26 

63.02 

-GED 

-Eilter 

78.26 

31.80 

45.23 

74.30 

47.00 

57.58 

M) 

+Eilter* 

64.47 

44.88 

52.92 

55.70 

58.66 

57.14 

<a> 

+GED 

-Eilter 

70.76 

59.01 

64.35 

62.46 

64.66 

63.54 

+Eilter* 

64.38 

66.43 

65.39 

56.20 

68.90 

61.90 

Table  4.  Performance  of  the  four  baselines  (YAPEX,  GENIA,  Dictionary,  and  C.V.  Eval.)  and  our  NER 
systems  (Weak-train,  Curated,  Merged)  at  entity-level  tested  on  the  fly  evaluation  data.  Bold  A;  scores 
represent  scores  that  are  higher  than  any  corresponding  baseline.  Extractors  denoted  by  *  will  be  tuned  in 
section  5. 


fdtering  or  the  (57  document)  weak-train  dataset 
also  leads  to  slight  improvements  in  performance. 

The  last  two  rows  of  Table  2  report  perform¬ 
ance  of  the  system  that  uses  the  best  combination 
of  techniques,  as  suggested  by  these  ablation  stud¬ 
ies.  For  mouse,  this  is  the  complete  system;  for  fly, 
it  is  the  system  trained  on  the  curated  data  only, 
with  GED  features,  but  without  SDR  and  sentence 
fdtering.  This  system  achieves  a  45%  improvement 
over  the  best  baseline. 

Tables  3  and  4  also  show  the  result  of  every 
combined  system.  In  the  mouse  dataset,  the 
weakly-trained  NER  systems  outperform  the  best 
baseline  whenever  they  are  trained  with  GED,  or 
whenever  it  included  sentence- fdtering  and  SDR. 
For  the  fly  dataset,  our  NER  systems  almost  al¬ 
ways  outperform  all  baselines.  For  the  mouse  data¬ 
set,  fdtering,  SDR,  and  GED  always  improve  F i, 
and  the  maximum  Fi  measure  of  71.4%  is  obtained 
when  all  three  methods  are  combined.  For  the  fly 
dataset,  only  GED  is  always  effective,  and  SDR  is 
effective  only  when  not  combined  with  GED.  We 
conjecture  that  when  precision  is  high  and  recall  is 
low,  SDR  is  more  likely  to  label  false  negatives 
than  true  negatives  as  gene-protein  names. 

The  maximum  F\  score  on  the  fly  dataset  was 
obtained  on  the  unfdtered  curated  data;  however, 
the  performance  of  the  nearly-complete  system 


(with  GED  and  fdtering)  trained  on  the  largest 
(merged)  dataset  is  similar  (65.4%).  We  conjec¬ 
ture  that  for  future  weak-training  problems  com¬ 
petitive  performance  can  be  obtained  by  either  the 
complete  system,  or  the  complete  system  without 
SDR. 

5  Extractor  Tuning 
5.1  Method 

The  sentence-fdtering  method  described  above 
increases  recall  at  the  expense  of  precision,  which 
may  not  be  appropriate  for  all  text-mining  applica¬ 
tions.  In  general,  one  would  like  for  it  to  be  possi¬ 
ble  to  adjust  the  recall-precision  tradeoff  of  an 
NER  system  to  suite  the  user’s  need;  for  instance, 
curators  of  biological  databases  might  prefer  a 
high-recall  gene-protein  name  extractor  to  assist 
them  in  identifying  most  gene-protein  candidate 
names.  To  create  such  an  extractor  we  tune  or 
tweak  [21]  the  threshold  term  of  some  of  our 
trained  extractors  (those  marked  with  *  in  Table  3 
and  4)  on  the  word-level  recall  of  the  tuning  data 
weak-train  (which  is  less  noisy  than  curated).  We 
pick  the  threshold  term  to  optimize  a  user-chosen  p 
value  in  the  complete  A-measure  formula: 


Figure  1.  Tweaking  extraetors  trained  on  the  mouse 
dataset  for  P  values  from  0.1  to  10  on  the  word- 
level  recall  of  weak-train  data.  The  four  baselines 
are  also  shown.  Merged  was  filtered,  and  all  extrac¬ 
tions  were  SDR  labeled. 


Figure  2.  Tweaking  extractors  trained  on  the  fly 
dataset  for  /?  values  from  0. 1  to  10  on  the  word- 
level  recall  of  weak-train  data.  The  four  baselines 
are  also  shown.  Merged  was  filtered,  and  all  extrac¬ 
tions  were  SDR  labeled. 


Fp{P,K)  = 


+  \)PR 
P^P  +  R 


(2) 


Flere  P  is  word-level  precision  and  R  is  word-level 
recall.  A  P  value  of  greater  than  1  assigns  higher 
importance  to  recall;  for  instance,  F2  weights  recall 
twice  as  much  as  precision.  These  tweaked  extrac¬ 
tors  are  then  evaluated  on  the  evaluation  data. 

Word-level  precision  measures  the  fraction  of 
words  (tokens)  that  are  part  of  a  predicted  entity 
name,  relative  to  the  number  of  words  that  are  part 
of  an  actual  entity  name.  Use  of  word-level  preci¬ 
sion  and  recall  rather  than  entity-level  precision 
and  recall  gives  some  credit  to  nearly-correct  entity 
boundaries  -  for  instance,  an  extractor  that  extends 
slightly  past  an  entity  boundary  will  receive  credit 
for  word  recall,  but  be  penalized  for  word  preci¬ 
sion. 


As  comparisons,  we  also  show  the  four  base¬ 
lines:  a  YAPEX-trained  NER  system,  a  GENIA- 
trained  NER  system,  soft  matching  using  diction¬ 
ary,  and  1 0-fold  cross  validation  (with  and  without 
GED  features).  As  expected,  the  higher  the  P 
value,  the  higher  the  word-level  recall  of  the  result¬ 
ing  tweaked  extractor.  Interestingly,  while  includ¬ 
ing  the  GED  features  always  improves  F i,  it  also 
appears  to  limit  the  degree  to  which  precision  can 
be  traded  off  for  recall.  We  were  able  to  generate  a 
high-recall  and  medium-precision  extractor, 
tweaked  for  /?  =  3  without  GED  features,  that  has  a 
word-level  precision,  recall,  and  Fj  of  about  58%, 
87%,  and  70%  respectively  for  the  mouse  dataset 
and  48%,  78%,  and  59%  respectively  for  the  fly 
dataset. 

6  Related  Work 


5,2  Results 

In  Figure  1  (mouse)  and  2  (fly),  each  shows  two 
precision-recall  curves  at  the  word-level;  one  is  a 
curve  of  tweaked  extractors  trained  without  GED 
features  and  the  other  with  GED  features.  Each 
data  point  on  a  line  represents  an  extractor  tweaked 
for  a  different /?  value  (0.1,  0.2,  ...,  0.9,  1,  2,  ...,  10) 
trained  on  filtered  examples  and  has  extractions 
SDR  labeled. 


The  identification  of  gene-protein  names  has  re¬ 
ceived  substantial  attention  in  the  bioinformatics 
community.  Some  prior  research  involves  training 
an  extractor  on  weakly-labeled  gene-protein  syno¬ 
nyms;  for  instance,  Hachey  et  al.  [12]  automati¬ 
cally  labeled  gene  text  fragments  by  identifying 
potential  genes  using  regular  expression  fuzzy 
matching,  and  then  trained  a  tagger  for  each  organ¬ 
ism.  The  most  closely  related  prior  work  is  that  of 
Morgan  et  al.  [22]  perform  pattern  matching  to 


find  candidate  mentions  in  FlyBase  abstracts  using 
synonym  lists  and  trained  a  FlMM-based  tagger  on 
these  noisy  training  data,  achieving  a  Fi  of  67% 
with  522,825  tokens  of  training  data  and  a  Fi  of 
75%  with  1,342,039  tokens  of  training  data. 

There  are  several  additional  contributions  of 
this  work.  Unlike  Morgan  et  al,  we  study  the  gen¬ 
erality  of  weak-labeling  methods  (our  system  is  the 
same  for  FlyBase  and  MGI).  We  also  study  the 
use  of  intra-document  repetition,  and  its  effect  on 
weakly-trained  NER  systems,  alone  and  in  combi¬ 
nation  with  other  methods.  We  also  study  the  ef¬ 
fect  of  sentence  filtering,  and  the  effect  of  GED 
(dictionary)  features  on  the  range  of  points  reach¬ 
able  on  a  recall-precision  curve.  Our  F]  perform¬ 
ance  for  the  fly  data  with  1057  abstracts  is 
comparable  to  that  obtained  by  Morgan  et  al.  with 
522,825  tokens  (approximately  2000-2500  ab¬ 
stracts).  Flowever,  Morgan  et  al  exploited  ortho¬ 
graphic  preprocessing  steps  that  we  did  not  use, 
and  the  effect  of  using  much  larger  training  sets. 
(Unfortunately  we  cannot  compare  directly  on  the 
same  test  set,  due  to  technical  issues  involving  to- 
kenization.) 

Some  other  prior  related  research  involves  un¬ 
supervised  identification  of  gene-protein  names. 
Wellner  [27]  incorporates  part-of-speech  as  factors 
for  proposing  gene  phrases  and  performs  exact 
matching  from  a  synonym  list  to  abstracts  for  an¬ 
notating  candidate  gene-protein  synonyms.  Cohen 
[5]  generates  orthographic  variants  of  gene-protein 
entities,  separates  out  regular  English  words  by 
using  English  word  dictionaries,  and  matches  the 
remaining  variants  against  biomedical  abstracts. 

The  contribution  of  this  paper  is  to  explore  and 
systematically  evaluate  several  different  techniques, 
in  isolation  and  in  combination,  for  the  gene- 
protein  NER  task:  sentence  filtering,  GED  features 
[24],  SDR  labeling  [20],  training  on  weakly- 
labeled  examples  [22],  and  tuning  trained  extrac¬ 
tors  [21].  We  also  contribute  to  the  community,  for 
each  of  fly  and  mouse  organism,  two  organism- 
specific  gene-protein  name  extractors  ;  one  has 
high  precision  but  medium  recall  and  the  other 
high  recall  but  medium  precision. 


7  Conclusions 

Manually  annotated  training  data  has  always  been 
difficult  to  produce.  This  is  especially  true  for 
biomedical  data,  because  expertise  in  biology  is 
required  to  annotate  gene-protein  names.  In  this 
paper,  we  trained  a  gene-protein  NER  system, 
without  manually  annotating  any  documents,  by 
utilizing  the  mouse  and  fly  dataset  from  BioCreA- 
tlvE  task  IB.  We  presented  an  automatic  approach 
for  creating  training  corpora  by  soft  matching  gene 
synonyms  into  abstracts.  We  illustrated  that  the 
NER  systems  trained  on  these  annotated  abstracts, 
combined  with  sentence  filtering,  SDR  labeling, 
and/or  GED  features,  can  outperform  all  baselines. 
Furthermore,  we  also  demonstrated  the  possibility 
of  converting  a  gene-protein  NER  system  with  de¬ 
cent  performance  into  a  high-recall  gene-protein 
name  extractor.  Our  results  demonstrate  that  the 
quality  of  named  entity  recognition  systems  can  be 
significantly  improved  through  the  use  of  readily 
available  data  and  thus  avoiding  the  difficult  proc¬ 
ess  of  manually  annotating  training  sets. 
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Appendix  A.  Stop-Words 

Eist  of  common  English  words  that  are  used  as 
stop-words  in  our  system:  all,  an,  and,  are,  as,  at, 
between,  but,  by,  can,  for,  from,  has,  in,  into,  is,  it, 
less,  likely,  more,  most,  much,  not,  of,  on,  or,  per, 
such,  that,  the,  through,  to,  via,  was,  we,  were, 
whereas,  whole,  with. 


^  Available  from  http://rcwang.com/pub/GeneNER.tar.gz 
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